Retrieval of video and vehicle behavior for a driving scene described in search text

ABSTRACT

The retrieval device extracts a feature corresponding to search text by inputting the search text into a pre-trained text feature extraction model. The retrieval device then, for plural combinations stored in a database associating a text description including plural sentences, with a vehicle-view video, and with vehicle behavior data representing temporal vehicle behavior, computes a text distance represented by a difference between a feature extracted from each sentence of the text description associated with the video and vehicle behavior data, and the feature corresponding to the search text. The retrieval device outputs as the search result a prescribed number of pairs of video and vehicle behavior data pairs in sequence from the smallest text distance according to the text distances.

CROSS-REFERENCE TO RELATED APPLICATION

This application is based on and claims priority under 35 USC 119 fromJapanese Patent Application No. 2019-138287 filed on Jul. 26, 2019, thedisclosure of which is incorporated by reference herein.

BACKGROUND Technical Field

Technology disclosed herein relates to a retrieval device, a trainingdevice, a retrieval system, a retrieval program, and a training program.

Related Art

Japanese Patent Application Laid-Open (JP-A) No. 2019-95878 disclosestechnology to query a driver's driving behavior data and drivingbehavior data similar to the query is extracted and output.

Further, “Weakly Supervised Video Moment Retrieval from Text Queries”(N. C. Mithun et al., CVPR2019) and “TALL: Temporal ActivityLocalization via Language Query” (J. Gao et al., ICCV2017) disclosetechnology using a search text query to retrieve video similar to thequery.

When there is a desire to search vehicle behavior data representingtemporal vehicle behavior, searching for vehicle behavior data using asearch text as a query, similarly to with a general search engine, ismore preferable than using vehicle behavior data as a query. Moreover,in addition to vehicle behavior data, it is also preferable to retrievevideo data (for example vehicle-view video data) that corresponds tosuch vehicle behavior data.

However, in the technology of JP-A No. 2019-95878, it is necessary toinput driving behavior data corresponding to vehicle behavior data as aquery. Moreover, the results output in the technology of JP-A No.2019-95878 are merely driving behavior data.

By contrast, the technologies of “Weakly Supervised Video MomentRetrieval from Text Queries” (N. C. Mithun et al., CVPR2019) and “TALL:Temporal Activity Localization via Language Query” (J. Gao et al.,ICCV2017) retrieve video using search text as a query. However, thetechnologies of “Weakly Supervised Video Moment Retrieval from TextQueries” (N. C. Mithun et al., CVPR2019) (hereafter referred to asNon-Patent Document 1) and “TALL: Temporal Activity Localization viaLanguage Query” (J. Gao et al., ICCV2017) (hereafter referred to asNon-Patent Document 2) are not capable of retrieving vehicle behaviordata.

Thus employing the related technologies does not enable the retrieval ofvideo and vehicle behavior data pairs corresponding to a driving scenedescribed in search text.

SUMMARY

A retrieval device according to a first aspect includes a memory, and aprocessor coupled to the memory, the processor being configured to:acquire a search text, extract a feature corresponding to the searchtext by inputting the search text to a text feature extraction modelconfigured to extract features from input sentences, the text featureextraction model being pre-trained so as to reduce a loss represented bya difference between a feature extracted from a sentence and a featureextracted from a correctly matched vehicle-view video, and also beingpre-trained so as to reduce a loss represented by a difference between afeature extracted from the sentence and a feature extracted fromcorrectly matched vehicle behavior data representing temporal vehiclebehavior, compute a text distance for each of a plurality ofcombinations stored in the memory, each combination associating a textdescription, including a plurality of sentences, with a vehicle-viewvideo and with vehicle behavior data representing temporal vehiclebehavior, the text distance being represented by a difference between afeature extracted from each sentence of the text description associatedwith the video and the vehicle behavior data, and the featurecorresponding to the search text, and output, as a search result, aprescribed number of pairs of video and vehicle behavior data pairs insequence from the smallest text distance, in accordance with all textdistances.

A training device according to a first aspect includes a memory, and aprocessor coupled to the memory, the processor being configured, foreach of a plurality of training data items associating a textdescription, including a plurality of sentences, with a vehicle-viewvideo and with vehicle behavior data representing temporal vehiclebehavior, to: extract a feature of a sentence of training data byinputting the sentence to a text feature extraction model configured toextract features from input sentences, extract a feature of a videocorresponding to the same training data by inputting the video to avideo feature extraction model configured to extract features from inputvideo, and compute a first loss represented by a difference between thesentence feature and the video feature; extract a feature of a sentenceof the training data by inputting the sentence to the text featureextraction model, extract a feature of vehicle behavior datacorresponding to the same training data by inputting the vehiclebehavior data to a vehicle behavior feature extraction model configuredto extract features from input vehicle behavior data, and compute asecond loss represented by a difference between the sentence feature andthe vehicle behavior data feature; compute an overall loss functionunifying the first loss with the second loss; train the text featureextraction model and the video feature extraction model so as to reducethe overall loss function; train the text feature extraction model andthe vehicle behavior feature extraction model so as to reduce theoverall loss function; and obtain a pre-trained sentence featureextraction model by causing the training processing to be performedrepeatedly, until the overall loss function computed by the unifyingsection becomes smaller than a prescribed threshold.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a schematic block diagram of a retrieval system according toan exemplary embodiment.

FIG. 2 is an explanatory diagram to explain an example of training dataof the exemplary embodiment.

FIG. 3 is an explanatory diagram to explain models of the exemplaryembodiment.

FIG. 4 is an explanatory diagram to explain models of the exemplaryembodiment.

FIG. 5 is a diagram illustrating an example of search results displayedon a display device.

FIG. 6 is a diagram illustrating an example of a configuration of acomputer to implement respective devices configuring a retrieval system.

FIG. 7 is a diagram illustrating an example of training processingexecuted by a training device according to the exemplary embodiment.

FIG. 8 is a diagram illustrating an example of retrieval processingexecuted by a retrieval device according to the exemplary embodiment.

DETAILED DESCRIPTION Exemplary Embodiment

Explanation follows regarding a retrieval system of an exemplaryembodiment, with reference to the drawings.

FIG. 1 is a block diagram illustrating an example of a configuration ofa retrieval system 10 according to the present exemplary embodiment. Asillustrated in FIG. 1, the retrieval system 10 includes a trainingdevice 12, a retrieval device 14, and a display device 15. The trainingdevice 12 and the retrieval device 14 are connected together by aprescribed method of communication.

Training Device 12

The training device 12 includes a database 16, a pre-trained modelstorage section 18, a first loss computation section 20, a second losscomputation section 22, a unifying section 24, a first training section26, a second training section 28, and a model acquisition section 30.The first loss computation section 20 and the second loss computationsection 22 are examples of computation sections of technology disclosedherein.

The database 16 is stored with plural items of training data in whichtext description including plural sentences, vehicle-view video, andvehicle behavior data representing temporal vehicle behavior, are storedin association with each other. Note that the vehicle behavior data mayalso be referred to as driving operation data representing temporalvehicle driving operation.

For example as illustrated in FIG. 2, the database 16 is stored withtext descriptions, videos, and vehicle behavior data stored inassociation with each other. The videos are videos captured by avehicle-mounted camera. The vehicle behavior data is vehicle behaviordata obtained when such videos were being captured. The videos and thevehicle behavior data are thus data acquired at the same time instantsas each other. The text description is writing to describe the video andthe vehicle behavior data, and includes plural sentences. The respectivesentences in the text descriptions describe the driving scene of thevideo and vehicle behavior data.

In the present exemplary embodiment, text descriptions associated withthe videos and vehicle behavior data are used to generate respectivemodels to retrieve video and vehicle behavior data from search text.

The pre-trained model storage section 18 is stored with a text featureextraction model 31, a video feature extraction model 32, a firstmapping model 33, a vehicle behavior feature extraction model 34, and asecond mapping model 35.

The text feature extraction model 31 extracts features from inputsentences. The video feature extraction model 32 extracts features frominput video. The vehicle behavior feature extraction model 34 extractsfeatures from vehicle behavior data. The first mapping model 33 and thesecond mapping model 35 will be described later.

As illustrated in FIG. 3, the video feature extraction model 32 isconfigured including an image feature extraction model 32A, a firstmatching model 32B, and a first output model 32C. As illustrated in FIG.4, the vehicle behavior feature extraction model 34 is configuredincluding a temporal feature extraction model 34A, a second matchingmodel 34B, and a second output model 34C. Functionality of each of thesemodels will be described later.

The first loss computation section 20 extracts sentence features byinputting training data sentences to the text feature extraction model31 for each of plural items of training data stored in the database 16.The first loss computation section 20 extracts video features byinputting video corresponding to the same training data to the videofeature extraction model 32. The first loss computation section 20 alsocomputes a first loss represented by a difference between the sentencefeatures and the video features.

Specifically, the first loss computation section 20 first reads each ofplural items of training data stored in the database 16. In thefollowing the processing performed on a single item of training datawill be described.

Next, the first loss computation section 20 inputs each of the pluralsentences in the text description of the training data item to the textfeature extraction model 31 stored in the pre-trained model storagesection 18, and extracts plural sentence features. Specifically, thefirst loss computation section 20 extracts a feature w_(j) ^(i) of thej^(th) sentence in the text description by inputting the text featureextraction model 31 with a j^(th) sentence in the text description thathas been associated with an i^(th) video in the training data.

Note that an auto-encoder built with a recurrent neural network (forexample LSTM or GRU) is employed in the text feature extraction model 31to extract sentence features. Note that, for example, a hidden vector ofthe encoder or decoder of the auto-encoder is employed as the sentencefeature. In the present exemplary embodiment, a feature is extracted foreach sentence of the text description, and the resulting featureobtained for the j^(th) sentence for the i^(th) video is denoted byw_(j) ^(i).

Next, the first loss computation section 20 extracts a video feature byinputting the video feature extraction model 32 stored in thepre-trained model storage section 18 with the video corresponding to thesame training data as the training data item that was employed in thetext feature extraction model 31.

The video feature is extracted using the image feature extraction model32A, the first matching model 32B, and the first output model 32C, asillustrated in FIG. 3. Specific explanation follows regarding how thevideo feature is extracted.

First, the first loss computation section 20 extracts individualfeatures v_(k) ^(i) for frame images at time instants k in the i^(th)video by inputting the image feature extraction model 32A with the frameimages at time instants k in the i^(th) video of the training data.

The image feature extraction model 32A is configured by a convolutionalneural network, a recurrent neural network, or the like. Note that thefeature of the frame image at time instant k in the i^(th) video in thetraining data is denoted by v_(k) ^(i).

Note that the output from an intermediate layer of the pre-trained modelmay be employed to extract features from frame images, as in “C3D” ofReference Document 1 below or as in “VGG16” of Reference Document 2below.

-   Reference Document 1: “Learning Spatiotemporal Features with 3D    Convolutional Networks” (D. Tran et al., ICCV pages 4489 to 4497,    2015)-   Reference Document 2: “Very Deep Convolutional Networks for    Large-Scale Image Recognition” (K. Simonyan and A. Zisserman, arXiv:    1409.1556, 2014)

Next, the first loss computation section 20 inputs the first matchingmodel 32B with combinations of the features v_(k) ^(i) of the frameimages at time instants k in the i^(th) video as extracted by the imagefeature extraction model 32A, combined with the features w_(j) ^(i) ofthe j^(th) sentences of the text description for the i^(th) video asextracted by the text feature extraction model 31. The first matchingmodel 32B calculates similarities s_(jk) ^(i) between the frame imagesat time instants k in the i^(th) video and the j^(th) sentences of thetext description. The first matching model 32B also calculates matchingresults as weighting coefficients a_(jk) ^(i) in accordance with thesimilarities s_(jk) ^(i).

The first matching model 32B matches frame images in the video againstthe text description to quantify the degree of matching therebetween.The features v_(k) ^(i) of the frame images in the video and thefeatures w_(j) ^(i) of each sentence of the text description areemployed for such matching.

Note that to perform matching, the features v_(k) ^(i) of the frameimages and the features w_(j) ^(i) of each sentence of the textdescription need to have the same dimensionality. Accordingly, in casesin which the dimensionality of the feature v_(k) ^(i) of the frameimages differs from the dimensionality of the feature w_(j) ^(i) of eachsentence of the text description, processing is, for example, performedto align the dimensionality of the feature v_(k) ^(i) of the frameimages with the dimensionality of the feature w_(j) ^(i) of eachsentence of the text description. For example as required, additionalarchitecture taking input of the feature v_(k) ^(i) of the frame imagesis added to the first matching model 32B so as to obtain a frame imagefeature v⁻ _(k) ^(i) of the same dimensionality as the feature w_(j)^(i) of each sentence of the text description. The additionalarchitecture may be a single-level or multi-level configurationincluding a fully-connected layer, a convolution layer, a pooling layer,an activation function, dropout, or the like. Note that for example, thematching processing of the first matching model 32B performsquantification by employing cosine similarity between the features v⁻_(k) ^(i) of the frame images and the feature w_(j) ^(i) of eachsentence of the text description (see, for example, Non-Patent Document1).

The first matching model 32B also uses the similarities s_(jk) ^(i) tocompute weighting coefficients a_(jk) ^(i) for the similarities betweenthe frame images at time instants k in the i^(th) video and the j^(th)sentences of the text description. For example, a method in which thesoftmax of the similarities s_(jk) may be employed therefor (see, forexample, Non-Patent Document 1).

Next, the first loss computation section 20 acquires features f_(j) ^(i)of the i^(th) video by inputting the first output model 32C with acombination of the weighting coefficients a_(jk) ^(i) that are thematching result for the i^(th) video of the training data output fromthe first matching model 32B, combined with the features vi of the frameimages at time instants k in the i^(th) video as extracted by the imagefeature extraction model 32A.

The feature f_(j) ^(i) of the i^(th) video corresponding to the j^(th)sentence in the text description is computed by the first output model32C by employing the features v_(k) ^(i) of the frame images and theweighting coefficient a_(jk) ^(i). For example, as in Equation (1)below, the feature f_(j) ^(i) of the i^(th) video is computed using alinear coupling in which the features v_(k) ^(i) of the frame images areweighted using the weighting coefficient a_(jk) ^(i) (see, for example,Non-Patent Document 1).f _(j) ^(i)=Σ_(k) a _(jk) ^(i) v _(k) ^(i)  Equation (1)

Next, the first loss computation section 20 inputs the first mappingmodel with a combination of the feature f_(j) ^(i) of the i^(th) videoof the training data as output from the first output model 32C, combinedwith the feature w_(j) ^(i) of the j^(th) sentence in the textdescription for the i^(th) video as output from the text featureextraction model 31, so as to acquire revamped video feature F_(j) ^(i)corresponding to the video feature f_(j) ^(i), and revamped sentencefeature W_(j) ^(i) corresponding to the sentence feature w_(j) ^(i).

The first mapping model is a model to map plural different features intothe same joint space. The video features f_(j) ^(i) and the sentencefeatures w_(j) ^(i) are embedded in space of the same dimensionality aseach other by the first mapping model so as to obtain revamped featuresF_(j) ^(i) for the video features f_(j) ^(i) and revamped features W_(j)^(i) for the sentence features w_(j) ^(i). Examples of embedding methodsthat may be employed include linear mapping (see, for example,Non-Patent Document 1), or employing any two freely selected functionsto give one mapping the same dimensionality as the other mapping.

The revamped feature F_(j) ^(i) of the i^(th) video of the training dataand the revamped feature W_(j) ^(i) of the j^(th) sentence in the textdescription describing the i^(th) video are thereby obtained.

Next, the first loss computation section 20 computes a first lossrepresented by a difference between the revamped video feature F_(j)^(i) and the revamped sentence feature W_(j) ^(i).

A loss function L_(VT) employed as the first loss may, for example,employ video-text loss (see, for example, Non-Patent Document 1).However, the loss function L_(VT) is not limited thereto, and the lossfunction L_(VT) may employ any freely selected function expressed byloss function L_(VT)=Σ_((i,j))I_(VT)(i,j) representing the sum of lossesI_(VT)(i, j) between the i^(th) video and the j^(th) sentence in thetext description.

The second loss computation section 22 extracts sentence features foreach of the plural training data items stored in the database 16 byinputting the training data sentences to the text feature extractionmodel 31, and extracts vehicle behavior data features by inputtingvehicle behavior data corresponding to the same training data to thevehicle behavior feature extraction model 34. The second losscomputation section 22 then computes a second loss represented by adifference between the sentence features and the vehicle behavior datafeatures.

Specifically, first the second loss computation section 22 reads each ofthe plural training data items stored in the database 16. The followingexplanation describes processing performed on a single item of trainingdata.

First, the second loss computation section 22 extracts a vehiclebehavior data feature by inputting the vehicle behavior featureextraction model 34 stored in the pre-trained model storage section 18with the training data that was already employed in the text featureextraction model 31, and with vehicle behavior data corresponding to thesame training data.

The vehicle behavior data feature is extracted using the temporalfeature extraction model 34A, the second matching model 34B, and thesecond output model 34C illustrated in FIG. 4. Specific explanationfollows regarding extraction of the vehicle behavior data features.

First, the second loss computation section 22 extracts a vehiclebehavior feature c_(l) ^(i) at time instant l for the i^(th) vehiclebehavior data by inputting the temporal feature extraction model 34Awith behavior at time instant l in the vehicle behavior data associatedwith the i^(th) video in the training data.

It is assumed here that the start time and end time have been specifiedin advance for the vehicle behavior data associated with the i^(th)video of the training data. Typically, features are extracted so as toinclude the period from the start time to the end time of the video;however, there is no limitation thereto.

Specifically, first the second loss computation section 22 divides thevehicle behavior data into windows [1, 1+W] for time instants l based ona window width W specified in advance by a user. Next, the second losscomputation section 22 employs an auto-encoder built using a recurrentneural network (for example LSTM or GRU) to extract features from thevehicle behavior corresponding to each window. For example, an embeddedvector or hidden vector of the encoder or decoder of the auto-encodermay be employed as the features. The vehicle behaviors feature c_(l)^(i) at time instants l are thereby extracted from the i^(th) vehiclebehavior data.

Next, the second loss computation section 22 inputs the second matchingmodel 34B with a combination of the vehicle behavior feature c_(l) ^(i)at time instant l in the vehicle behavior data as output from thetemporal feature extraction model 34A, combined with the sentencefeature w_(j) ^(i) extracted using the text feature extraction model 31.The second loss computation section 22 thereby calculates a similarityu_(jl) ^(i) between the vehicle behavior at time instant l in thevehicle behavior data associated with the i^(th) video, and the j^(th)sentence. The second loss computation section 22 also calculates as amatching result a weighting coefficient b_(jl) ^(i) in accordance withthe similarity u_(jl) ^(i).

The second matching model 34B matches vehicle behavior data against textdescriptions to quantify a degree of matching. The vehicle behaviorfeature c_(l) ^(i) at time instant l in the vehicle behavior data andthe feature w_(j) ^(i) of each sentence of the text description asextracted by the first loss computation section 20 are employed in suchmatching.

Note that when matching, the dimensionalities of the vehicle behaviorfeatures c_(l) ^(i) and the feature w_(j) ^(i) of each sentence of thetext description need to be the same. Accordingly, when thedimensionality of the vehicle behavior features c_(l) ^(i) and thefeature w_(j) ^(i) of each sentence of the text description differ fromeach other, processing is, for example, performed to align thedimensionality of the vehicle behavior features c_(l) ^(i) with thedimensionality of the feature w_(j) ^(i) of each sentence of the textdescription. For example as required, additional architecture takinginput of the vehicle behavior features c_(l) ^(i) is added to the secondmatching model 34B so as to obtain vehicle behavior features c⁻ _(l)^(i) of the same dimensionality as the feature w_(j) ^(i) of eachsentence of the text description. The additional architecture may beconfigured by a single-level or multi-level configuration including afully-connected layer, a convolution layer, a pooling layer, anactivation function, dropout, or the like. Note that for example, thematching processing by the second matching model 34B performsquantification by employing cosine similarity between the vehiclebehavior features c⁻ _(l) ^(i) and the feature w_(j) ^(i) of eachsentence of the text description (see, for example, Non-Patent Document1).

In the following explanation, the similarity between the j^(th) sentencein the text description associated with the i^(th) video and the vehiclebehavior in the window [1, 1+W] of the vehicle behavior data is denotedby u_(jl) ^(i).

The second matching model 34B computes, from the similarity u_(jl) ^(i),the weighting coefficient b_(jl) ^(i) of the similarity between thevehicle behavior at time instant l in the i^(th) video and the j^(th)sentence in the text description. For example, a method to calculate thesoftmax of the similarity u_(jl) ^(i) is employed (see, for example,Non-Patent Document 1).

Next, the second loss computation section 22 acquires a vehicle behaviordata feature g_(j) ^(i) by inputting the second output model 34C with acombination of the weighting coefficient b_(jl) ^(i) output from thesecond matching model 34B combined with the vehicle behavior featurec_(l) ^(i) at time instant l in the vehicle behavior data associatedwith the i^(th) video as extracted by the temporal feature extractionmodel 34A.

The second output model 34C employs the vehicle behavior feature c_(l)^(i) and the weighting coefficient b_(jl) ^(i) to compute the vehiclebehavior data feature g_(j) ^(i) for the i^(th) video and the j^(th)sentence in the text description. For example, as in Equation (2) below,the vehicle behavior data feature g_(j) ^(i) for the j^(th) sentence iscomputed using a linear coupling of the vehicle behavior features c_(l)^(i) weighted using the weighting coefficient b_(jl) ^(i) (see, forexample, Non-Patent Document 1).g _(j) ^(i)=

  Equation (2)

Next, the second loss computation section 22 acquires a revamped vehiclebehavior data feature G_(j) ^(i) corresponding to the vehicle behaviordata feature g_(j) ^(i) and a revamped sentence feature W^(˜) _(j) ^(i)corresponding to the sentence feature w_(j) ^(i) by inputting the secondmapping model 35 with a combination of the vehicle behavior data featureg_(j) ^(i) as output from the second output model 34C, combined with thefeature w_(j) ^(i) of the j^(th) sentence corresponding to the i^(th)video as extracted by the text feature extraction model 31.

The second mapping model 35 is a model to map plural different featuresinto the same joint space. The vehicle behavior data feature g_(j) ^(i)and the sentence feature w_(j) ^(i) are embedded into space of the samedimensionality by the second mapping model 35, and the revamped featureG_(j) ^(i) for the vehicle behavior data feature g_(j) ^(i) and therevamped feature W^(˜) _(j) ^(i) for the sentence feature w_(j) ^(i) areobtained thereby. Examples of the embedding method employed includelinear mapping (see, for example, Non-Patent Document 1), or employingany two freely selected functions to give one mapping the samedimensionality as the other mapping. Note that the embedding dimensionemployed here may be the same as the dimension of embedding by the firstloss computation section 20, or may be different thereto.

The revamped feature G_(j) ^(i) of the i^(th) vehicle behavior data inthe training data and the revamped feature W^(˜) _(j) ^(i) of the j^(th)sentence in the text description describing the vehicle behavior dataassociated with the i^(th) video are obtained thereby.

Next, the second loss computation section 22 computes a second lossrepresented by a difference between the revamped vehicle behavior datafeature G_(j) ^(i) and the revamped sentence feature W^(˜) _(j) ^(i).The revamped features are features embedded into the joint space.

A loss function L_(CT) employed as the second loss may, for example,employ video-text loss (see, for example, Non-Patent Document 1).However, the loss function L_(CT) is not limited thereto, and the lossfunction L_(CT) may employ any freely selected function expressed byloss function L_(CT)=Σ_((i,j))l_(CT)(_(i,j)) representing the sum oflosses l_(CT)(i, j) between the vehicle behavior data and the j^(th)sentence in the text description associated with the i^(th) video.

The unifying section 24 computes an overall loss function unifying afirst loss L_(VT) computed by the first loss computation section 20 anda second loss L_(CT) computed by the second loss computation section 22.

For example, as expressed by Equation (3) below, the unifying section 24computes an overall loss function L by performing linear coupling of thefirst loss L_(VT) computed in training across the videos and the textdescriptions, with the second loss L_(CT) computed in training acrossthe vehicle behavior data and the text descriptions. Note that λ in thefollowing Equation is a user-specified hyperparameter.

=

_(VT)+λ

_(CT),λ∈

,  Equation (3)

The first training section 26 trains the text feature extraction model31, the video feature extraction model 32, and the first mapping model33 so as to reduce the overall loss function L computed by the unifyingsection 24. Specifically, the first training section 26 updates therespective parameters in the text feature extraction model 31, the videofeature extraction model 32, and the first mapping model 33 so as toreduce the overall loss function L. Each of the models, including thetext feature extraction model 31, is thereby trained so as to reduce theloss represented by a difference between the features extracted from thesentences by the text feature extraction model 31 and the featuresextracted from videos correctly matched to the sentences.

The first training section 26 then updates the text feature extractionmodel 31 and each of the models included in the video feature extractionmodel 32 stored in the pre-trained model storage section 18.

The second training section 28 trains the text feature extraction model31, the vehicle behavior feature extraction model 34, and the secondmapping model 35 so as to reduce the overall loss function L computed bythe unifying section 24. Specifically, the second training section 28updates the respective parameters of the text feature extraction model31, the vehicle behavior feature extraction model 34, and the secondmapping model 35 so as to reduce the overall loss function L. Each ofthe models, including the text feature extraction model 31, is therebytrained so as to reduce the loss represented by a difference between thefeatures extracted from the sentences by the text feature extractionmodel 31 and features extracted from vehicle behavior data correctlymatched to the sentences.

The second training section 28 then updates the text feature extractionmodel 31 and each of the models included in the vehicle behavior featureextraction model 34 stored in the pre-trained model storage section 18.

For example, the first training section 26 and the second trainingsection 28 update the respective parameters using a mini-batch method. Astochastic optimization method such as a stochastic gradient descent(SGD), Adam, AdaGrad, or RMSprop may be employed to update therespective model parameter.

The model acquisition section 30 causes the training processing of thefirst training section 26 and the training processing of the secondtraining section 28 to be repeated until the overall loss function Lcomputed by the unifying section 24 becomes smaller than a prescribedthreshold.

The model acquisition section 30 acquires each pre-trained model whenthe overall loss function L has become smaller than the prescribedthreshold E. The model acquisition section 30 then stores eachpre-trained model in the pre-trained model storage section 18 andupdates the respective models.

Note that the text feature extraction model 31 trained by both the firsttraining section 26 and the second training section 28 is trained so asto reduce the loss represented by a difference between the featuresextracted from the sentences and the features extracted from videoscorrectly matched to the sentences. Moreover, the text featureextraction model 31 trained by both the first training section 26 andthe second training section 28 is also trained so as to reduce the lossrepresented by a difference between the features extracted from thesentences and the features extracted from the vehicle behavior datacorrectly matched to the sentences.

Accordingly, in the retrieval device 14, described below, video andvehicle behavior data appropriately described by search text isretrieved from search text by employing a text feature extraction modelto retrieve the video and vehicle behavior data.

Retrieval Device 14

The retrieval device 14 includes a database 40, a pre-trained modelstorage section 42, an acquisition section 44, a text feature extractionsection 46, a text distance computation section 48, and a search resultoutput section 49.

The database 40 is stored with the same data as the database 16 of thetraining device 12.

The pre-trained model storage section 42 is stored with the same modelsas each of the models stored in the pre-trained model storage section 18of the training device 12.

The acquisition section 44 acquires search texts q input by a user. Thesearch texts q are sentences used to retrieve vehicle-view video andvehicle behavior data associated with that video.

The text feature extraction section 46 inputs the search text q acquiredby the acquisition section 44 to the text feature extraction model 31stored in the pre-trained model storage section 42. The text featureextraction section 46 also extracts features output from the textfeature extraction model 31 corresponding to the search text q.

In the present exemplary embodiment, the search text q is expressed asq=(q₁, q₂), wherein q₁ is a sentence corresponding to a video, and q₂ asentence corresponding to vehicle behavior data.

Specifically, first the text feature extraction section 46 identifies inthe search text q a first sentence q₁, this being a sentencerepresenting a video, and a second sentence q₂, this being a sentencerepresenting vehicle behavior data. In the present exemplary embodiment,an example will be described in which there are two sentences includedin the search text q, the first sentence thereof being the firstsentence q₁, and the second sentence thereof being the second sentenceq₂.

Next, the text feature extraction section 46 extracts a feature Q₁ ofthe first sentence q₁ by inputting the first sentence q₁ to the textfeature extraction model 31. The text feature extraction section 46 alsoextracts a feature Q₂ of the second sentence q₂ by inputting the secondsentence q₂ to the text feature extraction model.

Next, the text feature extraction section 46 employs the respectivemodels stored in the pre-trained model storage section 42 to extract,for each of the plural training data items stored in the database 40,features from each sentence of the text description associated with thevideo and the vehicle behavior data.

Note that an embedded feature of a j₁ ^(th) sentence of a textdescription for the i^(th) video in the training data is denoted W_(j1)^(i). An embedded feature of the j₂ ^(th) sentence of a text descriptionfor the vehicle behavior data associated with the i^(th) video in thetraining data is denoted W_(j2) ^(i).

In the present exemplary embodiment, an example will be described inwhich the features W_(j1) ^(i), W_(j2) ^(i) are extracted by the textfeature extraction section 46 of the retrieval device 14; howeverfeatures W_(j1) ^(i), W_(j2) ^(i) extracted by the training device 12may also be employed therefor.

The text distance computation section 48 computes a text distancerepresenting a difference between the features W_(j1) ^(i), W_(j2) ^(i)extracted by the text feature extraction section 46 from each of thesentences of the text descriptions in the plural training data items,and the features Q₁, Q₂ corresponding to the search text, as extractedby the text feature extraction section 46.

Specifically, the text distance computation section 48 uses Equation (4)below to compute the difference between the feature Q₁ of the firstsentence q₁ and the feature W_(j1) ^(i) of the j^(th) sentence in thetext description associated with the i^(th) video stored in the database40.∥Q ₁ −W _(j) ₁ ^(i)∥  Equation (4)

The text distance computation section 48 also uses Equation (5) below tocompute the difference between the feature Q₂ of the second sentence q₂and the feature W^(˜) _(j2) ^(i) of the j₂ ^(th) sentence in the textdescription associated with i^(th) video stored in the database 40.∥Q ₂ −{tilde over (W)} _(j) ₂ ^(i)∥  Equation (5)

Note that ∥⋅∥ denotes the norm of a vector and, for example, an L2 normor an L1 norm may be employed therefor. v>0 is a parameter specified inadvance by a user.

The text distance computation section 48 then computes, as the textdistance, the value expressed by Equation (6) below, this being aweighted sum of the differences computed using Equation (4) and thedifferences computed using Equation (5).∥Q ₁ −W _(j1) ^(i) ∥+v∥Q ₂ −{tilde over (W)} _(j) ₂ ^(i)∥  Equation (6)

Note that the text distance is computed for each sentence of the textdescription for each training data item.

According to the text distances computed by the text distancecomputation section 48, the search result output section 49 identifies aprescribed number N of videos i^((n)) in sequence from the smallest textdistance, and two sentences j₁ ^((n)), j₂ ^((n)) included in the textdescriptions associated with each of these videos, according to Equation(7) below. Note that i^((n)) represents an index of videos in thetraining data, and j₁ ^((n)), j₂ ^((n)) represent an index of sentencesincluded in the text descriptions.{(i ^((n)) ,j ₁ ^((n)) ,j ₂ ^((n)))}_(n=1) ^(N)=arg min_(i,j) ₁ _(,j) ₂^((N)) {∥Q ₁ −W _(j) ₁ ^(i) ∥+v∥Q ₂ −{tilde over (W)} _(j) ₂^(i)∥}  Equation (7)

Equation (8) below is a function to return a collection of triplets (i,j₁, j₂) produced when a target function f (i, j₁, j₂) is extracted Ntimes in sequence from the smallest.argmin_(i,j) ₁ _(,j) ₂ ^((n))ƒ(i,j ₁ ,j ₂)  Equation (8)

Equation (7) above is used to identify the N videos i^((n)) in sequencefrom the smallest text distance and the sentences j₁ ^((n)), j₂ ^((n))of the text description associated with each of these videos.

Moreover, for each of an n^(th) pair (wherein 1≤n≤N) out of N pairs ofvideo and vehicle behavior data pairs, the search result output section49 identifies frame images for a segment [k_(s) ^((n)),k_(e) ^((n))] forwhich a weighting coefficient a^(i) _(j1(n)k) is larger than a thresholdδ₁, based on the weighting coefficient a^(i) _(j1(n)k) in accordancewith a similarity s^(i) _(jk) between a feature of the j₁ ^((n)th)sentence in the text description associated with the i^(th) videocorresponding to the n^(th) pair and a feature of a frame image at thetime instant k in the i^(th) video. Note that the weighting coefficientsa^(i) _(j1(n)k) for respective training data are calculated in advanceby the training device 12, and stored in the database 40.

Specifically, the search result output section 49 takes a maximum lengthtime segment K^((n))=[k_(s) ^((n)), k_(e) ^((n))] of consecutive timeinstants k satisfying weighting coefficient a^(i) _(j1(n)k)>δ₁ for theweighting coefficient threshold δ₁ 0<δ₁<1 specified in advance by theuser, as a video time band corresponding to the j₁ ^((n)) sentence inthe text description.

Moreover, for each of the n^(th) pair (1≤n≤N) out of the N pairs ofvideo and vehicle behavior data pairs, the search result output section49 identifies a vehicle behavior for a segment [l_(s) ^((n)), l_(e)^((n))] having a weighting coefficient b^(i) _(j2(n)1) larger than athreshold δ₂, based on the weighting coefficient b^(i) _(j2(n)1) inaccordance with a similarity u^(i) _(j2(n)1) between a feature of the j₂^((n)th) sentence in the text description associated with the vehiclebehavior data corresponding to the i^(th) video corresponding to then^(th) pair, and a vehicle behavior feature at time instant l in thevehicle behavior data corresponding to the i^(th) video. Note that theweighting coefficient b^(i) _(j2(n)1) for each training data item iscalculated in advance by the training device 12, and stored in thedatabase 40.

Specifically, as a vehicle behavior data time band corresponding to thesentence j₂ ^((n)) in the text description, the search result outputsection 49 takes a time segment L^((n))=[l_(s) ^((n)), l_(e) ^((n))] ofconsecutive time instants l satisfying weighting coefficient b^(i)_(j2(n)1)>δ₂ for the weighting coefficient threshold 0<δ₂<1 specified inadvance by the user.

The search result output section 49 then outputs as a search resultpairings of video time segments [k_(s) ^((n)), k_(e) ^((n))] and thevehicle behavior data time segments [l_(s) ^((n)), l_(e) ^((n))].

For example, the search result output section 49 employs the timesegment K^((n))=[k_(s) ^((n)), k_(e) ^((n))] of the video icorresponding to the search text q acquired by the acquisition section44, and the time segment L^((n))=[l_(s) ^((n)), l_(e) ^((n))] of thevehicle behavior data to display video and vehicle behavior data on thedisplay device 15. A pair of video and vehicle behavior datacorresponding to the search text q is thereby obtained.

The search result output from the search result output section 49 isoutput to the display device 15. For example, the display device 15displays search results as illustrated in FIG. 5, in which pairs of thevideo and vehicle behavior data are ranked for display.

The example illustrated in FIG. 5 is an example in which the search text“Traffic ahead of the car is stopped. The car is stopped.” has beeninput to the retrieval device 14 as a query. In this case, “Trafficahead of the car is stopped.” is identified as the first sentence q₁,and “The car is stopped.” is identified as the second sentence q₂.Videos described by the first sentence q₁ and vehicle behavior datadescribed by the second sentence q₂ are searched for, and N items areoutput as search results in sequence from the smallest loss. Note thatsensor 1 . . . sensor M of the vehicle behavior data represents vehiclebehavior data obtained by different sensors.

The training device 12 and the retrieval device 14 may, for example, beimplemented by a computer 50 such as that illustrated in FIG. 6. Thecomputer 50 includes a CPU 51, memory 52 serving as a temporary storageregion, and a non-volatile storage section 53. The computer 50 furtherincludes an input/output interface (I/F) 54 for connecting input/outputdevices (not illustrated in the drawings) and the like to, and aread/write (R/W) section 55 to control reading and writing of data withrespect to a recording medium 59. The computer 50 further includes anetwork I/F 56 that is connected to a network such as the internet. TheCPU 51, the memory 52, the storage section 53, the input/output I/F 54,the R/W section 55, and the network I/F 56 are connected togetherthrough a bus 57.

The storage section 53 may be implemented by a hard disk drive (HDD), asolid state drive (SSD), flash memory, or the like. The storage section53 serves as a storage medium, and is stored with a program to cause thecomputer 50 to function. The CPU 51 reads the program from the storagesection 53, expands the program in the memory 52, and sequentiallyexecutes processes in the program.

Next, explanation follows regarding operation of the retrieval system 10of the present exemplary embodiment.

Plural training data items are stored in the database 16 of the trainingdevice 12 stores. When the training device 12 receives a signalinstructing training processing, the training device 12 executes atraining processing routine as illustrated in FIG. 7.

At step S100, the first loss computation section 20 acquires the pluraltraining data items stored in the database 16. The second losscomputation section 22 also acquires the plural training data itemsstored in the database 16.

At step S102, the first loss computation section 20 acquires the videofeatures f_(j) ^(i) for the respective training data items acquired atstep S100 by inputting the videos to the video feature extraction model32, and acquires the sentence features w_(j) ^(i) by inputting each ofthe sentences of the text descriptions to the text feature extractionmodel 31. The first loss computation section 20 also acquires therevamped video features F_(j) ^(i) and the revamped sentence featuresW_(j) ^(i) by inputting the video features f_(j) ^(i) and the sentencefeatures w_(j) ^(i) to the first mapping model 33, and then computes thefirst loss L_(VT) represented by the difference between the revampedvideo features F_(j) ^(i) and the revamped sentence features W_(j) ^(i).

At step S104, the second loss computation section 22 acquires thevehicle behavior data feature g_(j) ^(i) for each of the training dataitems acquired at step S100 by inputting the vehicle behavior data tothe vehicle behavior feature extraction model 34. The second losscomputation section 22 also acquires the revamped vehicle behavior datafeatures G_(j) ^(i) and the revamped sentence features W^(˜) _(j) ^(i)by inputting the vehicle behavior data features g_(j) ^(i) and thesentence features w_(j) ^(i) to the second mapping model 35, and thencomputes the second loss L_(CT) represented by the difference betweenthe revamped vehicle behavior data features G_(j) ^(i) of the and therevamped sentence features W^(˜) _(j) ^(i).

At step S106, the unifying section 24 uses Equation (3) to compute theoverall loss function L unifying the first loss L_(VT) computed at stepS102 with the second loss L_(CT) computed at step S104.

At step S108, the model acquisition section 30 determines whether or notthe overall loss function L computed at step S106 is the prescribedthreshold ε or greater. Processing transitions to step S110 in cases inwhich the overall loss function L is the prescribed threshold E orgreater. The training processing routine is ended in cases in which theoverall loss function L is smaller than the prescribed threshold E.

At step S110, the first training section 26 trains the text featureextraction model 31 and the video feature extraction model 32 so as toreduce the overall loss function L computed at step S106.

At step S112, the second training section 28 trains the text featureextraction model 31 and the vehicle behavior feature extraction model 34so as to reduce the overall loss function L computed at step S106.

At step S114, the first training section 26 updates the text featureextraction model 31 and the respective models included in the videofeature extraction model 32 stored in the pre-trained model storagesection 18. The second training section 28 also updates the text featureextraction model 31 and the respective models included in the vehiclebehavior feature extraction model 34 stored in the pre-trained modelstorage section 18.

When training of each model by the training device 12 has beencompleted, each of these models is stored in the pre-trained modelstorage section 18 of the retrieval device 14. The respective valuescomputed by the training device 12 and the plural training data itemsare stored in the database 40 of the retrieval device 14.

When a user inputs a search text q, the retrieval device 14 executes aretrieval processing routine, as illustrated in FIG. 8.

At step S200, the acquisition section 44 acquires the search text qinput by the user.

At step S202, the text feature extraction section 46 inputs the firstsentence q₁, this being a sentence describing the video in the searchtext q acquired at step S200, to the text feature extraction model 31stored in the pre-trained model storage section 42 and extracts thefeature Q₁ for the first sentence q₁. The text feature extractionsection 46 also inputs the text feature extraction model 31 stored inthe pre-trained model storage section 42 with the second sentence q₂,this being a sentence describing the vehicle behavior data in the searchtext q acquired at step S200, and extracts the feature Q₂ for the secondsentence q₂.

At step S204, the text feature extraction section 46 uses the respectivemodels stored in the pre-trained model storage section 42 on each of theplural training data items stored in the database 40 to extract afeature from each sentence of the text description associated with thevideo and the vehicle behavior data.

At step S206, the text distance computation section 48 computes for eachof the plural training data items stored in the database 40 the textdistance represented by the difference between the features (for exampleW_(j1) ^(i), W_(j2) ^(i)) of each sentence of the text descriptions ofthe plural training data items extracted at step S204 and the featuresQ₁, Q₂ corresponding to the search text extracted at step S202.

At step S208, the search result output section 49 uses Equation (7) toidentify N videos i^((n)) in sequence from the smallest text distanceaccording to the text distances computed at step S206, and the twosentences j₁ ^((n)), j₂ ^((n)) in the text description associated witheach of these videos.

At step S210, for each n^(th) pair (1≤n≤N) out of the N pairs of videoand vehicle behavior data pairs, the search result output section 49outputs as a search result the segment K^((n))=[k_(s) ^((n)), k_(e)^((n))] in the videos selected at step S208 and the segmentL^((n))=[l_(s) ^((n)), l_(e) ^((n))] of the vehicle behavior dataassociated with these videos, and then ends the retrieval processingroutine.

As described above, the retrieval device 14 according to the presentexemplary embodiment extracts features corresponding to search text byinputting the search text to the text feature extraction model 31 thathas pre-trained so as to reduce the loss represented by the differencebetween a feature extracted from a sentence and a feature extracted froma correctly matched vehicle-view video, and also so as to reduce theloss represented by the difference between a feature extracted from thesentence and a feature extracted from correctly matched vehicle behaviordata representing temporal vehicle behavior. Moreover, for each of theplural combinations stored in the database of a text descriptionincluding plural sentences, associated with vehicle-view video, andassociated with vehicle behavior data representing temporal vehiclebehavior, the retrieval device 14 computes a text distance representedby a difference between a feature extracted from each sentence of thetext description that is associated with the video and the vehiclebehavior data, and the feature corresponding to the search text. Theretrieval device 14 also, according to the text distance, outputs assearch results a prescribed number of pairs of video and vehiclebehavior data pairs in sequence from the smallest text distance. Thisenables appropriate video and vehicle behavior data pairs to beretrieved that correspond to a driving scene described by the searchtext.

Moreover, for each of the plural training data items associating a textdescription including plural sentences, a vehicle-view video, andvehicle behavior data representing temporal vehicle behavior with eachother, the training device 12 according to the present exemplaryembodiment extracts sentence features by inputting the sentences of thetraining data to the text feature extraction model 31. The trainingdevice 12 also extracts the video features by inputting the videocorresponding to the same training data to the video feature extractionmodel 32. The training device 12 also computes the first lossrepresented by the difference between the sentence feature and the videofeature. The training device 12 also extracts vehicle behavior datafeatures by inputting the vehicle behavior data corresponding to thesame training data to the vehicle behavior feature extraction model 34.The training device 12 also computes the second loss represented by thedifference between the sentence feature and the vehicle behavior datafeature. Next, the training device 12 computes the overall loss functionL unifying the first loss with the second loss. The training device 12also trains the text feature extraction model 31 and the video featureextraction model 32 so as to reduce the overall loss function. Thetraining device 12 also trains the text feature extraction model 31 andthe vehicle behavior feature extraction model 34 so as to reduce theoverall loss function. The training device 12 also obtains thepre-trained text feature extraction model 31 by causing the trainingprocessing of the first training section and the training processing ofthe second training section to be repeated until the overall lossfunction is smaller than the prescribed threshold. This enables the textfeature extraction model 31 to be obtained that retrieves appropriatevideo and vehicle behavior data pairs corresponding to the driving scenedescribed by the search text. Note that in performing the trainingmethod that considers the video and the vehicle behavior data in thetraining device 12, the text feature extraction model 31 and the vehiclebehavior feature extraction model 34 both rely on sentence featureextraction. There is therefore a need to perform video and vehiclebehavior data training in parallel.

Note that although a case has been described in which the processingperformed by the respective devices in the exemplary embodimentdescribed above is software processing performed by executing a program,the processing may be performed by hardware. Alternatively, processingmay be performed using a combination of both software and hardware. Theprogram stored in the ROM may be stored on various storage media fordistribution.

Technology disclosed herein is not limited to the above, and obviouslyvarious other modifications may be implemented within a range notdeparting from the spirit thereof.

For example, any type of model may be employed as the respective models.For example, the respective models illustrated in FIG. 3 and therespective models illustrated in FIG. 4 may be configured from asingle-level or multi-level configuration including fully-connectedlayers, convolution layers, pooling layers, activation functions,dropouts, and the like.

In the exemplary embodiment described above, explanation has been givenregarding an example in which the video and the vehicle behavior dataare each separately mapped into the embedded space together with thetext description to find the first loss and the second lossrespectively. However, there is no limitation thereto. For example, thevideo and the vehicle behavior data may each be mapped into the sameembedded space so as to compute a loss therein.

In the exemplary embodiment described above, explanation has been givenregarding an example in which output as the search result is pairs ofthe segments K^((n))=[k_(s) ^((n)), k_(e) ^((n))] in the video pairedwith the segments L^((n))=[l_(s) ^((n)), l_(e) ^((n))] in the vehiclebehavior data associated with this video. However, there is nolimitation thereto. For example, just video and vehicle behavior datapairs may be output as the search result.

Moreover, configuration may be adopted in which a number n* (n*<N) ofpairs that is a number preset by a user is output as search results whenoutputting pairs configured by the segments K^((n))=[k_(s) ^((n)), k_(e)^((n))] of the video and the segments L^((n))=[l_(s) ^((n)), l_(e)^((n))] of the vehicle behavior data associated with this video.

In consideration of the above circumstances, an object of technologydisclosed herein is to provide a retrieval device, a training device, aretrieval system, a retrieval program, and a training program capable ofretrieving video and vehicle behavior data pairs corresponding to adriving scene described in search text.

Solution to Problem

A retrieval device according to a first aspect includes an acquisitionsection, a text feature extraction section, a computation section, and asearch result output section. The acquisition section is configured toacquire a search text. The text feature extraction section is configuredto extract a feature corresponding to the search text acquired by theacquisition section by inputting the search text to a text featureextraction model configured to extract features from input sentences.The text feature extraction model is pre-trained so as to reduce a lossrepresented by a difference between a feature extracted from a sentenceand a feature extracted from a correctly matched vehicle-view video, andis also pre-trained so as to reduce a loss represented by a differencebetween a feature extracted from the sentence and a feature extractedfrom correctly matched vehicle behavior data representing temporalvehicle behavior. The computation section is configured to compute atext distance for each of plural combinations stored in a database witheach combination associating a text description including pluralsentences, with a vehicle-view video, and with vehicle behavior datarepresenting temporal vehicle behavior. The text distance is representedby a difference between a feature extracted from each sentence of thetext description associated with the video and vehicle behavior data,and the feature corresponding to the search text. The search resultoutput section is configured to output as a search result a prescribednumber of pairs of video and vehicle behavior data pairs in sequencefrom the smallest text distance according to the text distances computedby the computation section.

In the retrieval device according to the first aspect, a featurecorresponding to the search text is extracted by inputting the searchtext to the text feature extraction model pre-trained so as to reduce aloss represented by the difference between a feature extracted from asentence and a feature extracted from a correctly matched vehicle-viewvideo, and pre-trained so as to reduce a loss represented by thedifference between a feature extracted from the sentence and a featureextracted from correctly matched vehicle behavior data representingtemporal vehicle behavior. The retrieval device then outputs as thesearch result the prescribed number of pairs of video and vehiclebehavior data pairs in sequence from the smallest text distanceaccording to the text distance represented by the difference between afeature extracted from each sentence of the text description associatedwith the video and vehicle behavior data, and the feature correspondingto the search text. This enables retrieval of video and vehicle behaviordata pairs corresponding to a driving scene described by the searchtext.

A retrieval device of a second aspect has the following configuration.The text feature extraction section therein is configured to extract afeature Q₁ of a first sentence q₁ that is a sentence in the search textthat describes a video by inputting the first sentence q₁ to a textfeature extraction model, and extract a feature Q₂ of a second sentenceq₂ that is a sentence in the search text that describes vehicle behaviordata by inputting the second sentence q₂ to a text feature extractionmodel. The computation section therein is configured to compute the textdistance according to a difference between the feature Q₁ of the firstsentence q₁ and a feature W_(j1) ^(i) of a j₁ ^(th) sentence of a textdescription associated with an i^(th) video for each of plural trainingdata items stored in a database, and also according to a differencebetween the feature Q₂ of the second sentence q₂ and a feature W^(˜)_(j2) ^(i) of a j₂ ^(th) sentence of a text description associated withthe i^(th) video stored in the database. The search result outputsection therein is configured to output as a search result N pairs ofvideo and vehicle behavior data pairs in sequence from the smallest textdistance. This enables retrieval of video and vehicle behavior datapairs that correspond to a driving scene described by the search text inconsideration of the sentence describing the video and the sentencedescribing the vehicle behavior that are included in the search text.

In a retrieval device of a third aspect, for each n^(th) (1≤n≤N) pairincluded in the N pairs of video and vehicle behavior data pairs, thesearch result output section is configured to output as the searchresult a pair of frame images of a segment [k_(s) ^((n)), k_(e) ^((n))]for which a weighting coefficient a^(i) _(j1(n)k) is larger than athreshold δ₁, based on the weighting coefficient a^(i) _(j1(n)k) inaccordance with a similarity s^(i) _(jk) between a feature of a j₁^((n)th) sentence in the text description associated with the i^(th)video corresponding to the n^(th) pair and a feature of a frame imagefor a time instant k in the i^(th) video, and a vehicle behavior of asegment [l_(s) ^((n)), l_(e) ^((n))] for which a weighting coefficientb^(i) _(j2(n)1) is larger than a threshold δ₂, based on the weightingcoefficient b^(i) _(j2(n)1) in accordance with a similarity u^(i)_(j2(n)1) between a feature of a j₂ ^((n)th) sentence in the textdescription associated with vehicle behavior data corresponding to thei^(th) video corresponding to the n^(th) pair and a feature of a vehiclebehavior at a time instant l in the vehicle behavior data correspondingto the i^(th) video. This thereby enables a driving scene described bythe search text from out of the video and vehicle behavior data pairs tobe appropriately presented.

In a retrieval device of a fourth aspect, the search result outputsection is configured to output as the search result a number of n*pairs out of the N pairs of video and vehicle behavior data pairs,wherein the number n* has been preset by a user. This thereby enablesthe number of search results desired by a user to be appropriatelypresented.

A training device according to a fifth aspect is a training deviceincluding a first loss computation section, a second loss computationsection, a unifying section, a first training section, a second trainingsection, and a model acquisition section. For each of plural trainingdata items associating a text description including plural sentences,with a vehicle-view video, and with vehicle behavior data representingtemporal vehicle behavior, the first loss computation section isconfigured to extract a feature of a sentence of training data byinputting the sentence to a text feature extraction model configured toextract features from input sentences, to extract a feature of a videocorresponding to the same training data by inputting the video to avideo feature extraction model configured to extract features from inputvideo, and to compute a first loss represented by a difference betweenthe sentence feature and the video feature. The second loss computationsection is configured to extract a feature of a sentence of the trainingdata by inputting the sentence to the text feature extraction model, toextract a feature of vehicle behavior data corresponding to the sametraining data by inputting the vehicle behavior data to a vehiclebehavior feature extraction model configured to extract features frominput vehicle behavior data, and to compute a second loss represented bya difference between the sentence feature and the vehicle behavior datafeature. The unifying section is configured to compute an overall lossfunction unifying the first loss with the second loss. The firsttraining section is configured to train the text feature extractionmodel and the video feature extraction model so as to reduce the overallloss function computed by the unifying section. The second trainingsection is configured to train the text feature extraction model and thevehicle behavior feature extraction model so as to reduce the overallloss function computed by the unifying section. The model acquisitionsection is configured to obtain a pre-trained sentence featureextraction model by causing the training processing by the firsttraining section and training processing by the second training sectionto be performed repeatedly until the overall loss function computed bythe unifying section becomes smaller than a prescribed threshold.

In the training device according to the fifth aspect, the first lossrepresented by the difference between the sentence feature and the videofeature is computed, the second loss represented by the differencebetween the sentence feature and the vehicle behavior data feature iscomputed, and the overall loss function unifying the first loss with thesecond loss is computed. The training device then trains the textfeature extraction model and the video feature extraction model so as toreduce the overall loss function. The training device also trains thetext feature extraction model and the vehicle behavior featureextraction model so as to reduce the overall loss function. This therebyenables the text feature extraction model to be obtained that retrievesappropriate video and vehicle behavior data pairs corresponding to adriving scene described by the search text. Specifically, generating thetext feature extraction model that considers the relationships of thetext description to both video and vehicle behavior data enables thetext feature extraction model to be obtained that retrieves appropriatevideo and vehicle behavior data pairs corresponding to a driving scenedescribed by the search text.

A training device according to a sixth aspect is configured as follows.The first loss computation section therein is configured to acquire arevamped sentence feature and a revamped video feature mapped into asame joint space by inputting the sentence feature extracted by the textfeature extraction model and the video feature extracted by the videofeature extraction model to a first mapping model configured to mapplural different features into the same joint space, and is alsoconfigured to compute a first loss represented by a difference betweenthe revamped sentence feature and the revamped video feature. The secondloss computation section therein is configured to acquire a revampedsentence feature and a revamped vehicle behavior data feature mappedinto a same joint space by inputting the sentence feature extracted bythe text feature extraction model and the vehicle behavior data featureextracted by the vehicle behavior feature extraction model to a secondmapping model configured to map plural different features into the samejoint space, and is also configured to compute a second loss representedby a difference between the revamped sentence feature and the revampedvehicle behavior data feature. The differences between features can becomputed due the sentence features and the video features being mappedinto the same joint space, and the sentence features and the vehiclebehavior data being mapped into the same joint space. This in turnenables the text feature extraction model to be trained appropriately.

A training device according to a seventh aspect is configured asfollows. The video feature extraction model includes an image featureextraction model configured to extract features from images, a firstmatching model configured to match sentence features against imagefeatures, and a first output model configured to output video featuresbased on matching results output from the first matching model and theimage feature. The vehicle behavior feature extraction model includes atemporal feature extraction model configured to extract features fromvehicle behavior at each time instant of the vehicle behavior data, asecond matching model configured to match sentence features againstvehicle behavior features, and a second output model configured tooutput vehicle behavior data features based on matching results outputfrom the second matching model and the vehicle behavior feature. Foreach of the plural training data items, the first loss computationsection is configured to extract a feature vi of a frame image at timeinstant k of an i^(th) video of the training data by inputting a frameimage at the time instant k in the i^(th) video to the image featureextraction model, extract a feature w_(j) ^(i) of the j^(th) sentence ofa text description associated with the i^(th) video of the training databy inputting the j^(th) sentence in the text description to the textfeature extraction model, calculate a similarity s_(jk) ^(i) between theframe image at the time instant k in the i^(th) video and the j^(th)sentence in the text description by inputting the first matching modelwith a combination of the feature v_(k) ^(i) of the frame image at thetime instant k in the i^(th) video of the training data combined withthe feature w_(j) ¹ of the j^(th) sentence in the text description forthe i^(th) video, and also calculate a weighting coefficient a_(jk) ^(i)in accordance with the similarity s_(jk) as a matching result, acquire afeature f_(j) ^(i) of the j^(th) sentence for the i^(th) video byinputting the first output model with a combination of the weightingcoefficient a_(jk) ^(i) that is the matching result for the i^(th) videoof the training data combined with the feature v_(k) ^(i) of the frameimage at the time instant k in the i^(th) video, acquire a revampedvideo feature F_(j) ^(i) corresponding to a feature f_(j) ^(i) of thei^(th) video of the training data and a revamped sentence feature W_(j)^(i) corresponding to the feature w_(j) ^(i) of the j^(th) sentence inthe text description for the i^(th) video by inputting the first mappingmodel with a combination of the video feature f_(j) ^(i) combined withthe sentence feature w_(j) ^(i), and compute a first loss represented bya difference between the revamped video feature F_(j) ^(i) and therevamped sentence feature W_(j) ^(i). For each of the plural trainingdata items, the second loss computation section is configured to extracta feature c_(l) ^(i) of the vehicle behavior at time instant l in i^(th)vehicle behavior data associated with the i^(th) video of the trainingdata by inputting the vehicle behavior feature extraction model with abehavior at time instant l in the vehicle behavior data, calculate asimilarity u_(jl) ^(i) between the vehicle behavior at time instant l inthe vehicle behavior data associated with the i^(th) video and thej^(th) sentence in the text description by inputting the second matchingmodel with a combination of the feature c_(l) ^(i) of the behavior attime instant l in the vehicle behavior data associated with the i^(th)video of the training data combined with the feature w_(j) ^(i) of thej^(th) sentence in the text description for the i^(th) video, andcalculate a weighting coefficient b_(jl) ^(i) in accordance with thesimilarity u_(jl) ^(i) as a matching result, acquire a feature g_(j)^(i) of the vehicle behavior data by inputting the second output modelwith plural combinations of the weighting coefficient b_(jl) ^(i) thatis the matching result for the vehicle behavior data associated with thei^(th) video of the training data combined with the feature c_(l) ^(i)of the vehicle behavior at time instant l for the vehicle behavior dataassociated with the i^(th) video, acquire a revamped feature G_(j) ^(i)of the vehicle behavior data corresponding to the feature g_(j) ^(i) ofthe j^(th) sentence for the vehicle behavior data associated with thei^(th) video of the training data and a revamped feature W^(˜) _(j) ^(i)of the sentence corresponding to the feature w_(j) ^(i) of the j^(th)sentence in the text description for the i^(th) video by inputting thesecond mapping model with a combination of the sentence feature g_(j)^(i) combined with the sentence feature w_(j) ^(i), and compute a secondloss represented by a difference between the revamped vehicle behaviordata feature G_(j) ^(i) and the revamped sentence feature W^(˜) _(j)^(i).

A retrieval system according to an eighth aspect is a retrieval systemincluding a retrieval device and a training device, wherein the textfeature extraction model employed in the retrieval device is apre-trained text feature extraction model trained by the trainingdevice.

A recording medium according to a ninth aspect is a recording mediumrecorded with a retrieval program to cause a computer to executeprocessing. The processing includes: acquiring a search text; extractinga feature corresponding to the search text by inputting the acquiredsearch text to a text feature extraction model configured to extractfeatures from input sentences, the text feature extraction model beingpre-trained so as to reduce a loss represented by a difference between afeature extracted from a sentence and a feature extracted from acorrectly matched vehicle-view video, and also being pre-trained so asto reduce a loss represented by a difference between a feature extractedfrom the sentence and a feature extracted from correctly matched vehiclebehavior data representing temporal vehicle behavior; computing a textdistance for each of plural combinations stored in a database with eachcombination associating a text description including plural sentences,with a vehicle-view video, and with vehicle behavior data representingtemporal vehicle behavior, the text distance being represented by adifference between a feature extracted from each sentence of the textdescription associated with the video and vehicle behavior data, and thefeature corresponding to the search text; and outputting as a searchresult a prescribed number of pairs of video and vehicle behavior datapairs in sequence from the smallest text distance according to thecomputed text distances.

A recording medium according to a tenth aspect is a recording mediumrecorded with a training program to cause a computer to executeprocessing for each of plural training data items associating a textdescription including plural sentences, with a vehicle-view video, andwith vehicle behavior data representing temporal vehicle behavior. Theprocessing includes: extracting a feature of a sentence of training databy inputting the sentence to a text feature extraction model configuredto extract features from input sentences; extracting a feature of avideo corresponding to the same training data by inputting the video toa video feature extraction model configured to extract features frominput video; computing a first loss represented by a difference betweenthe sentence feature and the video feature; extracting a feature of asentence of the training data item by inputting the sentence to the textfeature extraction model; extracting a feature of vehicle behavior datacorresponding to the same training data by inputting the vehiclebehavior data to a vehicle behavior feature extraction model configuredto extract features from input vehicle behavior data; computing a secondloss represented by a difference between the sentence feature and thevehicle behavior data feature; computing an overall loss functionunifying the first loss with the second loss; executing first trainingprocessing to train the text feature extraction model and the videofeature extraction model so as to reduce the computed overall lossfunction; executing second training processing to train the text featureextraction model and the vehicle behavior feature extraction model so asto reduce the computed overall loss function; and obtaining apre-trained sentence feature extraction model by causing the firsttraining processing and the second training processing to be performedrepeatedly until the computed overall loss function becomes smaller thana prescribed threshold.

As described above, the technology disclosed herein exhibits theadvantageous effect of enabling the retrieval of video and vehiclebehavior data pairs corresponding to a driving scene described by searchtext.

The disclosures of Japanese Patent Application No. 2019-138287, filed onJul. 26, 2019 are incorporated herein by reference in their entirety.

All publications, patent applications, and technical standards mentionedin this specification are herein incorporated by reference to the sameextent as if each individual publication, patent application, ortechnical standard was specifically and individually indicated to beincorporated by reference.

The invention claimed is:
 1. A retrieval device, comprising: a memory,and a processor coupled to the memory, the processor being configuredto: acquire a search text, extract a feature corresponding to the searchtext by inputting the search text to a text feature extraction modelconfigured to extract features from input sentences, the text featureextraction model being pre-trained so as to reduce a loss represented bya difference between a feature extracted from a sentence and a featureextracted from a correctly matched vehicle-view video, and also beingpre-trained so as to reduce a loss represented by a difference between afeature extracted from the sentence and a feature extracted fromcorrectly matched vehicle behavior data representing temporal vehiclebehavior, compute a text distance for each of a plurality ofcombinations stored in the memory, each combination associating a textdescription, including a plurality of sentences, with a vehicle-viewvideo and with vehicle behavior data representing temporal vehiclebehavior, the text distance being represented by a difference between afeature extracted from each sentence of the text description associatedwith the video and the vehicle behavior data, and the featurecorresponding to the search text, and output, as a search result, aprescribed number of pairs of video and vehicle behavior data pairs insequence from the smallest text distance, in accordance with all textdistances.
 2. The retrieval device of claim 1, wherein the processor isconfigured to: extract a feature Q₁ of a first sentence q₁, which is asentence in the search text that describes a video, by inputting thefirst sentence q₁ into the text feature extraction model; extract afeature Q₂ of a second sentence q₂, which is a sentence in the searchtext that describes vehicle behavior data, by inputting the secondsentence q₂ into the text feature extraction model; compute the textdistance according to a difference between the feature Q₁ of the firstsentence q₁ and a feature W_(jl) ^(i) of a j_(l) ^(th) sentence of atext description associated with an i^(th) video for each of a pluralityof training data items stored in the memory, and according to adifference between the feature Q₂ of the second sentence q₂ and afeature W^(˜) _(j2) ^(i) of a j₂ ^(th) sentence of a text descriptionassociated with the i^(th) video stored in the memory; and output, as asearch result, N pairs of video and vehicle behavior data pairs insequence from the smallest text distance.
 3. The retrieval device ofclaim 2, wherein for each n^(th) (1≤n≤N) pair included in the N pairs ofvideo and vehicle behavior data pairs, the processor is configured tooutput, as the search result, a pair of: frame images of a segment[k_(s) ^((n)), k_(e) ^((n))] for which a weighting coefficient a^(i)_(jl(n)k) is larger than a threshold δ₁, based on the weightingcoefficient a^(i) _(jl(n)k) in accordance with a similarity s^(i) _(jk)between a feature of a j₁ ^((n)th) sentence in the text descriptionassociated with the i^(th) video, corresponding to the n^(th) pair and afeature of a frame image for a time instant k in the i^(th) video, and avehicle behavior of a segment [l_(s) ^((n)), l_(e) ^((n))] for which aweighting coefficient b^(i) _(j2(n)1) is larger than a threshold δ₂,based on the weighting coefficient b^(i) _(j2(n)1) in accordance with ha similarity u^(i) _(j2(n)l) between a feature of a j₂ ^((n)th) sentencein the text description associated with vehicle behavior data,corresponding to the i^(th) video corresponding to the n^(th) pair and afeature of a vehicle behavior at a time instant l in the vehiclebehavior data corresponding to the i^(th) video.
 4. The retrieval deviceof claim 1, wherein the processor is configured to output, as the searchresult, a number of n* pairs of the N pairs of video and vehiclebehavior data pairs, wherein the number n* has been preset by a user. 5.A training device, comprising: a memory, and a processor coupled to thememory, the processor being configured, for each of a plurality oftraining data items associating a text description, including aplurality of sentences, with a vehicle-view video and with vehiclebehavior data representing temporal vehicle behavior, to: extract afeature of a sentence of training data by inputting the sentence to atext feature extraction model configured to extract features from inputsentences, extract a feature of a video corresponding to the sametraining data by inputting the video to a video feature extraction modelconfigured to extract features from input video, and compute a firstloss represented by a difference between the sentence feature and thevideo feature; extract a feature of a sentence of the training data byinputting the sentence to the text feature extraction model, extract afeature of vehicle behavior data corresponding to the same training databy inputting the vehicle behavior data to a vehicle behavior featureextraction model configured to extract features from input vehiclebehavior data, and compute a second loss represented by a differencebetween the sentence feature and the vehicle behavior data feature;compute an overall loss function unifying the first loss with the secondloss; train the text feature extraction model and the video featureextraction model so as to reduce the overall loss function; train thetext feature extraction model and the vehicle behavior featureextraction model so as to reduce the overall loss function; and obtain apre-trained sentence feature extraction model by causing the trainingprocessing to be performed repeatedly, until the overall loss functioncomputed by the unifying section becomes smaller than a prescribedthreshold.
 6. The training device of claim 5, wherein the processor isconfigured to: acquire a revamped sentence feature and a revamped videofeature mapped into a same joint space, by inputting the sentencefeature extracted by the text feature extraction model and the videofeature extracted by the video feature extraction model to a firstmapping model configured to map a plurality of different features intothe same joint space, and compute a first loss represented by adifference between the revamped sentence feature and the revamped videofeature; and acquire a revamped sentence feature and a revamped vehiclebehavior data feature mapped into a same joint space, by inputting thesentence feature extracted by the text feature extraction model and thevehicle behavior data feature extracted by the vehicle behavior featureextraction model to a second mapping model configured to map a pluralityof different features into the same joint space, and compute a secondloss represented by a difference between the revamped sentence featureand the revamped vehicle behavior data feature.
 7. The training deviceof claim 6, wherein: the video feature extraction model includes animage feature extraction model configured to extract features fromimages, a first matching model configured to match sentence featuresagainst image features, and a first output model configured to outputvideo features based on matching results output from the first matchingmodel and the image feature; the vehicle behavior feature extractionmodel includes a temporal feature extraction model configured to extractfeatures from vehicle behavior at each time instant of the vehiclebehavior data, a second matching model configured to match sentencefeatures against vehicle behavior features, and a second output modelconfigured to output vehicle behavior data features based on matchingresults output from the second matching model and the vehicle behaviorfeature; for each of the plurality of training data items, the processoris configured to: extract a feature v_(k) ^(i) of a frame image at timeinstant k of an i^(th) video of the training data, by inputting a frameimage at the time instant k in the i^(th) video to the image featureextraction model, extract a feature w_(j) ^(i) of a j^(th) sentence of atext description associated with the i^(th) video of the training data,by inputting the j^(th) sentence in the text description to the textfeature extraction model, calculate a similarity s_(jk) ^(i) between theframe image at the time instant k in the i^(th) video and the j^(th)sentence in the text description, by inputting the first matching modelwith a combination of the feature v_(k) ^(i) of the frame image at thetime instant k in the i^(th) video of the training data, combined withthe feature w_(j) ^(i) of the j^(th) sentence in the text descriptionfor the i^(th) video, and also calculate a weighting coefficient a_(jk)^(i) in accordance with the similarity s_(jk) ^(i) as a matching result;acquire a feature f_(j) ^(i) of the j^(th) sentence for the i^(th)video, by inputting the first output model with a combination of theweighting coefficient a_(jk) ^(i) that is the matching result for thei^(th) video of the training data, combined with the feature v_(k) ^(i)of the frame image at the time instant k in the i^(th) video; acquire arevamped video feature F_(j) ^(i) corresponding to a feature f_(j) ^(i)of the i^(th) video of the training data and a revamped sentence featureW_(j) ^(i) corresponding to the feature w_(j) ^(i) of the j^(th)sentence in the text description for the i^(th) video, by inputting thefirst mapping model with a combination of the video feature f_(j) ^(i)combined with the sentence feature w_(j) ^(i), and compute a first lossrepresented by a difference between the revamped video feature F_(j)^(i) and the revamped sentence feature W_(j) ^(i); and for each of theplural training data items, the processor is configured to: extract afeature u_(jl) ^(i) of the vehicle behavior at time instant l in i^(th)vehicle behavior data associated with the i^(th) video of the trainingdata, by inputting the vehicle behavior feature extraction model with abehavior at time instant l in the vehicle behavior data, calculate asimilarity u_(jl) ^(i) between the vehicle behavior at time instant l inthe vehicle behavior data associated with the i^(th) video and thej^(th) sentence in the text description, by inputting the secondmatching model with a combination of the feature c_(l) ^(i) of thebehavior at time instant l in the vehicle behavior data associated withthe i^(th) video of the training data, combined with the feature w_(j)^(i) of the j^(th) sentence in the text description for the i^(th)video, and calculate a weighting coefficient b_(jl) ^(i) in accordancewith the similarity u_(jl) ^(i) as a matching result, acquire a featureg_(l) ^(i) of the vehicle behavior data, by inputting the second outputmodel with a plurality of combinations of the weighting coefficientb_(jl) ^(i) that is the matching result for the vehicle behavior dataassociated with the i^(th) video of the training data, combined with thefeature c_(l) ^(i) of the vehicle behavior at time instant l for thevehicle behavior data associated with the i^(th) video, acquire arevamped feature G_(j) ^(i) of the vehicle behavior data correspondingto the feature g_(j) ^(i) of the j^(th) sentence for the vehiclebehavior data associated with the i^(th) video of the training data anda revamped feature W^(˜) _(j) ^(i) of the sentence corresponding to thefeature w_(j) ^(i) of the j^(th) sentence in the text description forthe i^(th) video, by inputting the second mapping model with acombination of the sentence feature g_(j) ^(i) combined with thesentence feature w_(j) ^(i), and compute a second loss represented by adifference between the revamped vehicle behavior data feature G_(j) ^(i)and the revamped sentence feature W^(˜) _(j) ^(i).
 8. A retrievalsystem, comprising: a retrieval device comprising: a memory, and aprocessor coupled to the memory, the processor being configured to:acquire a search text, extract a feature corresponding to the searchtext by inputting the search text to a text feature extraction modelconfigured to extract features from input sentences, the text featureextraction model being pre-trained so as to reduce a loss represented bya difference between a feature extracted from a sentence and a featureextracted from a correctly matched vehicle-view video, and also beingpre-trained so as to reduce a loss represented by a difference between afeature extracted from the sentence and a feature extracted fromcorrectly matched vehicle behavior data representing temporal vehiclebehavior, compute a text distance for each of a plurality ofcombinations stored in the memory, each combination associating a textdescription, including a plurality of sentences, with a vehicle-viewvideo and with vehicle behavior data representing temporal vehiclebehavior, the text distance being represented by a difference between afeature extracted from each sentence of the text description associatedwith the video and the vehicle behavior data, and the featurecorresponding to the search text, and output, as a search result, aprescribed number of pairs of video and vehicle behavior data pairs insequence from the smallest text distance, in accordance with all textdistances; and a training device comprising: a memory, and a processorcoupled to the memory, the processor being configured, for each of aplurality of training data items associating a text description,including a plurality of sentences, with a vehicle-view video and withvehicle behavior data representing temporal vehicle behavior, to:extract a feature of a sentence of training data by inputting thesentence to a text feature extraction model configured to extractfeatures from input sentences, extract a feature of a videocorresponding to the same training data by inputting the video to avideo feature extraction model configured to extract features from inputvideo, and compute a first loss represented by a difference between thesentence feature and the video feature; extract a feature of a sentenceof the training data by inputting the sentence to the text featureextraction model, extract a feature of vehicle behavior datacorresponding to the same training data by inputting the vehiclebehavior data to a vehicle behavior feature extraction model configuredto extract features from input vehicle behavior data, and compute asecond loss represented by a difference between the sentence feature andthe vehicle behavior data feature; compute an overall loss functionunifying the first loss with the second loss; train the text featureextraction model and the video feature extraction model so as to reducethe overall loss function; train the text feature extraction model andthe vehicle behavior feature extraction model so as to reduce theoverall loss function; and obtain a pre-trained sentence featureextraction model by causing the training processing to be performedrepeatedly, until the overall loss function computed by the unifyingsection becomes smaller than a prescribed threshold, wherein the textfeature extraction model employed in the retrieval device is apre-trained text feature extraction model trained by the trainingdevice.
 9. A non-transitory recording medium recorded with a retrievalprogram to cause a computer to execute processing, the processingcomprising: acquiring a search text; extracting a feature correspondingto the search text by inputting the acquired search text to a textfeature extraction model configured to extract features from inputsentences, the text feature extraction model being pre-trained so as toreduce a loss represented by a difference between a feature extractedfrom a sentence and a feature extracted from a correctly matchedvehicle-view video, and also being pre-trained so as to reduce a lossrepresented by a difference between a feature extracted from thesentence and a feature extracted from correctly matched vehicle behaviordata representing temporal vehicle behavior; computing a text distancefor each of a plurality of combinations stored in a database with eachcombination associating a text description, including a plurality ofsentences, with a vehicle-view video and with vehicle behavior datarepresenting temporal vehicle behavior, the text distance beingrepresented by a difference between a feature extracted from eachsentence of the text description associated with the video and vehiclebehavior data, and the feature corresponding to the search text; andoutputting, as a search result, a prescribed number of pairs of videoand vehicle behavior data pairs in sequence from the smallest textdistance, in accordance with the computed text distances.